--- layout: post title: Classification of Sport Activities ---

BACKGROUND

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

DATA

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.

WHAT I WILL DO

The goal of this study is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. Any of the other variables may be used to predict with. I will explain step by step how i built my model, how i used cross validation, what i think the expected out of sample error is, and why i made the choices i did. I will also use the decided prediction model to predict 20 different test cases at the end of the study.

DATA PREPERATION

After the preprocessing of the data we will have 3 data sets on hand: training, testing and pml_testing. I will use training to train the model. The testing set will actually be used for model selection(validation). Pml_testing set will be used for making the predictions with the selected model.

setwd("C:/Users/Acer-nb/Downloads/ML_Project")

# read csv data with na and blank values set to na
pml_training <- read.csv("pml-training.csv",na.strings=c("NA","NaN", " ",""))
pml_testing <- read.csv("pml-testing.csv",na.strings=c("NA","NaN", " ",""))
# remove columns with na values
pml_train <- pml_training[,!colSums(is.na(pml_training))>0]
pml_test <- pml_testing[,!colSums(is.na(pml_testing))>0] 

dropcl <- grep("name|timestamp|window|X", colnames(pml_train), value=F) 
pml_training <- pml_train[,-dropcl]
dropcl <- grep("name|timestamp|window|X", colnames(pml_test), value=F) 
pml_testing <- pml_test[,-dropcl]

# 
set.seed(1234)
inTrain <- createDataPartition(y=pml_training$classe,p=0.8, list=FALSE)
training <- pml_training[inTrain,]
testing <- pml_training[-inTrain,]

EXPLORING DATA

After removing unrequired columns there are 52 predictors in the data. This is quite a lot, so we may not need all of them in our model.

# Make a matrix of correlations of all predictors
M <- abs(cor(training[,-53]))

# Set the diagonal to zero (the correlation of a predictor with itself, it's 1, we know, so we should remove it)
diag(M) <- 0

# Find the parameters having correlation over a threshold.
which(M > 0.8,arr.ind=T)
##                  row col
## yaw_belt           3   1
## total_accel_belt   4   1
## accel_belt_y       9   1
## accel_belt_z      10   1
## accel_belt_x       8   2
## magnet_belt_x     11   2
## roll_belt          1   3
## roll_belt          1   4
## accel_belt_y       9   4
## accel_belt_z      10   4
## pitch_belt         2   8
## magnet_belt_x     11   8
## roll_belt          1   9
## total_accel_belt   4   9
## accel_belt_z      10   9
## roll_belt          1  10
## total_accel_belt   4  10
## accel_belt_y       9  10
## pitch_belt         2  11
## accel_belt_x       8  11
## gyros_arm_y       19  18
## gyros_arm_x       18  19
## magnet_arm_x      24  21
## accel_arm_x       21  24
## magnet_arm_z      26  25
## magnet_arm_y      25  26
## accel_dumbbell_x  34  28
## accel_dumbbell_z  36  29
## gyros_dumbbell_z  33  31
## gyros_forearm_z   46  31
## gyros_dumbbell_x  31  33
## gyros_forearm_z   46  33
## pitch_dumbbell    28  34
## yaw_dumbbell      29  36
## gyros_forearm_z   46  45
## gyros_dumbbell_x  31  46
## gyros_dumbbell_z  33  46
## gyros_forearm_y   45  46

As seen in the results some variables are correlated, which means some of them should not exist in the model. The feature selection should be performed and the selected fewer features should be used to construct the models.

The following exploratory graphs shows that some features are quite helpful to seperate the classes, so we may expect high accuracy levels when predicting the outcome.

plot_ly(training,color =training$classe,y=training[,1],type="box")
plot_ly(training,color =training$classe,y=training[,3],type="box")

MODELS - PCA

The following models have been set with 25 features extracted by PCA from the data. Even after this kind of a feature selection method, the random forest model takes too long to run.

# Create as many components as required to explain %95 of the variance

preProc <- preProcess((training[,-53]+1),method=c("center","scale","pca"),thresh = 0.95)
trainPC <- predict(preProc,(training[,-53]+1))
testPC <- predict(preProc,(testing[,-53]+1))
mod1 <- train(x=trainPC, y=training$classe, method="lda")
## Loading required package: MASS
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:plotly':
## 
##     select
mod2 <- train(x=trainPC, y=training$classe, method="knn")
pred1 <- predict(mod1,testPC)
pred2 <- predict(mod2,testPC)

Even after simplifying with PCA the models takes too long to run, so i tried to perform another method to make the models simpler.

MODELS - EARTH PACKAGE

The earth package implements variable importance based on Generalized cross validation (GCV), number of subset models the variable occurs (nsubsets) and residual sum of squares (RSS). I tried this method on the data as follows:

marsModel <- earth(classe~., data=training)
ev <- evimp (marsModel,trim = FALSE)
ev
##                            nsubsets   gcv    rss
## roll_belt                        21 100.0  100.0
## magnet_dumbbell_y                20  90.6   90.6
## roll_forearm                     19  85.6   85.6
## accel_belt_z                     17  72.8   72.9
## magnet_dumbbell_z                16  70.0   70.1
## yaw_belt                         15  66.0   66.1
## roll_dumbbell                    13  56.2   56.4
## total_accel_dumbbell             12  51.9   52.1
## pitch_belt                       10  43.5   43.7
## pitch_forearm                     7  31.9   32.1
## total_accel_belt-unused           0   0.0    0.0
## gyros_belt_x-unused               0   0.0    0.0
## gyros_belt_y-unused               0   0.0    0.0
## gyros_belt_z-unused               0   0.0    0.0
## accel_belt_x-unused               0   0.0    0.0
## accel_belt_y-unused               0   0.0    0.0
## magnet_belt_x-unused              0   0.0    0.0
## magnet_belt_y-unused              0   0.0    0.0
## magnet_belt_z-unused              0   0.0    0.0
## roll_arm-unused                   0   0.0    0.0
## pitch_arm-unused                  0   0.0    0.0
## yaw_arm-unused                    0   0.0    0.0
## total_accel_arm-unused            0   0.0    0.0
## gyros_arm_x-unused                0   0.0    0.0
## gyros_arm_y-unused                0   0.0    0.0
## gyros_arm_z-unused                0   0.0    0.0
## accel_arm_x-unused                0   0.0    0.0
## accel_arm_y-unused                0   0.0    0.0
## accel_arm_z-unused                0   0.0    0.0
## magnet_arm_x-unused               0   0.0    0.0
## magnet_arm_y-unused               0   0.0    0.0
## magnet_arm_z-unused               0   0.0    0.0
## pitch_dumbbell-unused             0   0.0    0.0
## yaw_dumbbell-unused               0   0.0    0.0
## gyros_dumbbell_x-unused           0   0.0    0.0
## gyros_dumbbell_y-unused           0   0.0    0.0
## gyros_dumbbell_z-unused           0   0.0    0.0
## accel_dumbbell_x-unused           0   0.0    0.0
## accel_dumbbell_y-unused           0   0.0    0.0
## accel_dumbbell_z-unused           0   0.0    0.0
## magnet_dumbbell_x-unused          0   0.0    0.0
## yaw_forearm-unused                0   0.0    0.0
## total_accel_forearm-unused        0   0.0    0.0
## gyros_forearm_x-unused            0   0.0    0.0
## gyros_forearm_y-unused            0   0.0    0.0
## gyros_forearm_z-unused            0   0.0    0.0
## accel_forearm_x-unused            0   0.0    0.0
## accel_forearm_y-unused            0   0.0    0.0
## accel_forearm_z-unused            0   0.0    0.0
## magnet_forearm_x-unused           0   0.0    0.0
## magnet_forearm_y-unused           0   0.0    0.0
## magnet_forearm_z-unused           0   0.0    0.0

As the model recommends only 10 of the variables have effect on the outcome. So i made the subset of the data with these 10 variables.

training_imp <- subset(training,select = c(classe,roll_belt,magnet_dumbbell_y,roll_forearm,accel_belt_z,magnet_dumbbell_z,yaw_belt,roll_dumbbell,total_accel_dumbbell,pitch_belt,pitch_forearm))
testing_imp <- subset(testing,select = c(classe,roll_belt,magnet_dumbbell_y,roll_forearm,accel_belt_z,magnet_dumbbell_z,yaw_belt,roll_dumbbell,total_accel_dumbbell,pitch_belt,pitch_forearm))
pml_testing_imp <- subset(pml_testing,select = c(roll_belt,magnet_dumbbell_y,roll_forearm,accel_belt_z,magnet_dumbbell_z,yaw_belt,roll_dumbbell,total_accel_dumbbell,pitch_belt,pitch_forearm))

I built 4 models with this subset.

mod11 <- train(classe~.,data=training_imp,method="rf",preProcess=c("center","scale"))
mod12 <- bagging(classe~.,data=training_imp,preProcess=c("center","scale"))
mod13 <- C5.0(classe~.,data=training_imp,preProcess=c("center","scale"))
mod14 <- train(classe~.,data=training_imp,method="knn",preProcess=c("center","scale"))
pred11 <- predict(mod11,testing)
pred12 <- predict(mod12,testing)
pred13 <- predict(mod13,testing)
pred14 <- predict(mod14,testing)
confusionMatrix(predict(mod14,testing_imp),testing_imp$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1096   30    5    2    0
##          B    9  671   18    7   10
##          C    6   23  629   26   12
##          D    3   31   26  605   12
##          E    2    4    6    3  687
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9401          
##                  95% CI : (0.9322, 0.9473)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9242          
##  Mcnemar's Test P-Value : 2.215e-05       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9821   0.8841   0.9196   0.9409   0.9528
## Specificity            0.9868   0.9861   0.9793   0.9780   0.9953
## Pos Pred Value         0.9673   0.9385   0.9037   0.8936   0.9786
## Neg Pred Value         0.9928   0.9726   0.9830   0.9883   0.9894
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2794   0.1710   0.1603   0.1542   0.1751
## Detection Prevalence   0.2888   0.1823   0.1774   0.1726   0.1789
## Balanced Accuracy      0.9844   0.9351   0.9495   0.9595   0.9741
confusionMatrix(predict(mod13,testing_imp),testing_imp$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1090   11    8    1    1
##          B   21  721    8    3    6
##          C    4   15  651   11    6
##          D    1    8   14  626    3
##          E    0    4    3    2  705
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9669          
##                  95% CI : (0.9608, 0.9722)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9581          
##  Mcnemar's Test P-Value : 0.2972          
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9767   0.9499   0.9518   0.9736   0.9778
## Specificity            0.9925   0.9880   0.9889   0.9921   0.9972
## Pos Pred Value         0.9811   0.9499   0.9476   0.9601   0.9874
## Neg Pred Value         0.9908   0.9880   0.9898   0.9948   0.9950
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2778   0.1838   0.1659   0.1596   0.1797
## Detection Prevalence   0.2832   0.1935   0.1751   0.1662   0.1820
## Balanced Accuracy      0.9846   0.9690   0.9703   0.9828   0.9875
confusionMatrix(predict(mod12,testing_imp),testing_imp$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1106    8    2    0    3
##          B    3  732    4    2    5
##          C    5   10  672    6    3
##          D    1    7    6  635    3
##          E    1    2    0    0  707
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9819          
##                  95% CI : (0.9772, 0.9858)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2e-16         
##                                           
##                   Kappa : 0.9771          
##  Mcnemar's Test P-Value : 0.05179         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9910   0.9644   0.9825   0.9876   0.9806
## Specificity            0.9954   0.9956   0.9926   0.9948   0.9991
## Pos Pred Value         0.9884   0.9812   0.9655   0.9739   0.9958
## Neg Pred Value         0.9964   0.9915   0.9963   0.9976   0.9956
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2819   0.1866   0.1713   0.1619   0.1802
## Detection Prevalence   0.2852   0.1902   0.1774   0.1662   0.1810
## Balanced Accuracy      0.9932   0.9800   0.9875   0.9912   0.9898
confusionMatrix(predict(mod11,testing_imp),testing_imp$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1112    3    1    0    0
##          B    2  743    5    0    2
##          C    2    7  677    4    0
##          D    0    6    1  639    1
##          E    0    0    0    0  718
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9913         
##                  95% CI : (0.9879, 0.994)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.989          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9964   0.9789   0.9898   0.9938   0.9958
## Specificity            0.9986   0.9972   0.9960   0.9976   1.0000
## Pos Pred Value         0.9964   0.9880   0.9812   0.9876   1.0000
## Neg Pred Value         0.9986   0.9950   0.9978   0.9988   0.9991
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2835   0.1894   0.1726   0.1629   0.1830
## Detection Prevalence   0.2845   0.1917   0.1759   0.1649   0.1830
## Balanced Accuracy      0.9975   0.9880   0.9929   0.9957   0.9979

All models performed quite satisfactory, but the winner was mod14, built with random forest with 0.99 accuracy.

Here are the results with the small test data of 20 observations:

predict(mod14,pml_testing)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

All test activities are predicted correctly, so the model works well.

DISCUSSION

The classes in testing data are predicted with 99% accuracy, which is almost a perfect score. The other performance metrics are also very high. This means that we have a very small out-of-sample error rate, but a question raises here, is there an over-fitting issue? There should not be. The number of observations are quite sufficient and i do not expect to observe very different variations in real life.